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1  Introduction 


In  the  past  decade,  major  architectural  achievements  aimed  at  building  fast  processing  elements 
were  accomplished  with  pipelining  and  parallelization  .  The  current  expertise  in  machine  archi¬ 
tecture  has  a  well-balanced  knowledge  about  the  design  of  highly  pipelined  processors  and  it 
appears  that  the  major  bottleneck  in  building  fast  computers  is  in  providing  memory  systems 
that  are  capable  of  matching  the  speed  of  these  fast  processors.  This  problem  has  been  exten¬ 
sively  studied  and  is  solved,  in  the  case  of  single  processor  architecture,  with  the  use  of  fast  cache 
memories  [11].  These  cache  memories  appear  to  be  efficient  in  matching  processor  speed  as  long 
as  a  memory  access  hits  in  the  cache  but  do  not  improve  anything  when  the  access  misses.  In 
conventional  sequential  architecture  the  probability  of  a  cache  miss  can  be  rendered  arbitrary 
small  by  building  sufficiently  large  cache  memories,  and  it  turns  out  that  in  most  applications, 
the  speed  ratio  between  the  processor  and  the  main  memory  is  such  that  rather  small  caches 
(4  kilowords)  are  sufficient  to  cache  up  to  4  megawords  of  memory  and  to  achieve  a  very  high 
processor  utilization  (over  99  %). 

The  next  step  in  making  faster  machines  is  naturally  the  use  of  several  such  processor/cache 
elements  organized  in  a  way  that  allows  them  to  work  in  parallel.  A  big  class  of  such  parallel 
architectures  is  based  on  a  model  of  computation  that  relies  on  the  concept  of  shared  memory. 
In  this  model,  several  processors  work  concurrently  on  the  same  set  of  data,  using  dedicated 
memory  locations  for  synchronization  purposes.  It  appears  that  the  performance  of  such  systems 
drops  very  rapidly  as  the  number  of  processors  grows  for  three  main  reasons: 

•  The  need  for  synchronization  variables  implies  that  the  corresponding  memory  locations 
cannot  usually  be  cached.  No  matter  how  big  the  caches  are,  and  how  clever  the  cache 
management  is,  there  will  always  be  a  minimum  number  of  cache  misses  due  to  broadcast¬ 
ing  of  synchronization  operations. 

•  The  fact  that  data  are  shared  means  that  a  piece  of  data  is  very  likely  to  be  requested 
by  different  processing  elements,  and  therefore  increases  the  traffic  between  memory  and 
caches. 

•  The  communication  medium  is  now  also  much  more  critical,  since  a  given  memory  access 
now  needs  to  compete  with  other  processor  requests,  and  therefore  the  communication 
delays  between  caches  and  memory  become  quite  a  bit  higher. 

To  summarize,  cache  misses  cannot  be  rendered  arbitrarily  rare  any  more,  and  in  addition  the 
delays  induced  by  each  cache  miss  are  now  substantially  higher.  To  solve  this  new  problem, 
three  areas  of  computer  architecture  were  extensively  explored  by  the  research  community: 

•  Performances  of  efficient  communication  medium  such  as  crossbar  networks  were  studied 
and  analyzed  [9]. 

•  Clever  cache  management  schemes,  such  as  the  snoopy  cache  [4],  were  developed  to  reduce 
the  traffic  due  to  shared  data. 

•  New  processor  architectures  were  developed  using  techniques  such  as  pipelining  and  mem¬ 
ory  prefetching  -  or  more  recently  multi-threading  [12]  -  to  allow  the  processor  to  work  in 
parallel  with  a  memory  access,  thereby  diminishing  the  idle  time  due  to  non-cache  accesses. 

Interesting  results  have  been  obtained  individually  in  each  of  these  areas;  nevertheless,  when  it 
comes  to  combining  them  into  an  efficient  design  they  appear  to  be  rather  incompatible.  The 
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simple  and  efficient  snoopy  cache  works  well  only  with  a  bus  as  a  communication  medium,  while 
an  efficient  switching  network  requires  a  rather  complicated  directory-based  cache  management 
scheme  [1]. 

In  the  Multilisp  [5]  project,  we  started  by  the  design  of  the  processor.  We  found  the  idea 
of  multi-threaded  architecture  particularly  elegant  as  an  alternative  to  conventional  pipelined 
design  for  the  following  two  reasons: 

•  Instead  of  filling  up  the  pipeline  with  instructions  from  the  same  stream,  we  fill  it  with 
instructions  of  different  streams.  In  the  conventional  pipeline,  if  an  instruction  is  delayed 
by  a  long  memory  access,  the  amount  of  work  (instructions)  available  to  the  processor  until 
it  gets  suspended  waiting  for  the  memory  access  to  complete  is  no  greater  than  the  length 
of  the  pipeline.  In  a  multi-threaded  architecture,  this  is  no  longer  true.  The  processor 
finds  available  work  not  only  in  its  pipeline  but  also  by  switching  to  a  different  task,  which 
means  that  the  available  stock  of  potentially  executable  instructions  is  now  bounded  above 
by  the  product  of  the  pipeline’s  length  and  the  number  of  ready-to-run  task  frames.  It 
thereby  achieves  more  parallelism  between  processor  and  memory. 

•  In  a  conventional  design,  these  multiple  threads  would  be  executed  by  several  processors, 
each  one  being  attached  to  its  own  private  cache.  Of  course,  this  avoids  the  problem  of 
having  several  threads  competing  for  the  same  cache  slot  and  for  that  reason  decreases 
the  cache  miss  ratio.  On  the  other  hand,  the  effect  of  shared  data  being  communicated 
between  caches  tend  to  increase  the  cache  miss  ratio.  An  intelligent  sharing  of  cache  space 
according  to  the  needs  of  processes,  combined  with  an  effective  management  policy,  can 
yield  a  higher  hit  ratio  [13].  Furthermore,  even  with  poor  management  of  the  competition 
for  cache  slots,  higher  processor  utilization  can  be  achieved  [6]. 

Although  multi-threaded  processors  are  appealing  for  the  previously  presented  reasons,  cur¬ 
rent  cache  expertise  does  not  allow  actually  taking  advantage  of  their  features.  The  purpose  of 
our  work  was  to  study  suitable  caches,  carefully  isolating  the  problems  that  such  an  architec¬ 
ture  induces.  It  turned  out  that  the  management  of  such  caches  requires  a  fairly  complicated 
protocol  to  ensure  coherency.  In  the  course  of  our  design  research  we  therefore  found  the  need 
for  efficient  ways  of  proving  the  correctness  of  an  implementation.  Although  this  problem  is  not 
new,  it  is  only  recently  that  people  actually  tried  to  specify  complicated  multi-processor  archi¬ 
tectures  and  deal  with  correctness  proofs.  We  found  that  the  classical  formulations  for  various 
cache  coherency  properties  [8]  stated  in  natural  language,  were  difficult  to  apply  to  complicated 
designs,  particularly  because  there  is  no  simple,  accurate  way  to  describe  such  designs  using 
natural  language.  We  therefore  introduced  a  theoretical  framework,  in  which  one  can  express  in 
a  unified  way  all  classical  coherency  properties.  We  then  developed  a  methodology  for  proving 
statements  within  this  framework,  and  showed  how  it  is  suitable  for  proving  the  correctness  of 
cache  coherency  protocols. 

In  the  section  below  we  present  the  various  problems  that  we  had  to  solve  for  the  design  of 
our  cache,  and  relate  them  to  already  known  problems.  We  synthesize  a  variety  of  solutions  into 
achieving  our  goal  and  give  an  overview  of  the  resulting  cache  architecture.  This  is  immediately 
followed  by  a  section  which  describes  precisely  both  the  architecture  and  the  cache  management 
protocol.  Next  we  state  in  what  sense  our  design  is  correct  and  part  of  the  proofs  are  detailed. 
This  is  followed  by  the  presentation  of  our  formalism  along  with  an  explanation  of  our  moti¬ 
vations.  In  the  last  section  of  this  paper,  our  proof  of  correctness  based  on  this  formalism  is 
completed. 
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2  Overview  of  the  cache 


The  ability  to  issue  a  memory  request  in  advance,  so  that  the  time  interval  induced  by  the 
memory  delay  can  be  used  by  the  processor  to  make  progress  on  the  execution  of  some  other 
instructions,  is  common  to  both  pipelined  and  multi-threaded  processors.  In  conventional  caches, 
as  soon  as  the  access  generates  a  cache  miss,  the  associated  processor  is  held,  preventing  it  from 
making  any  further  progress  on  other  instructions.  This  is  of  course  rather  unfortunate  for  the 
performance  of  the  system,  but  has  the  advantage  of  yielding  a  quite  simple  cache  design: 

•  There  is  no  need  for  the  cache  to  keep  track  of  outstanding  requests. 

•  There  is  no  need  for  sophisticated  management  to  handle  the  inconsistencies  that  may 
occur  if  a  second  request  modifies  the  location  involved  in  the  outstanding  memory  access. 
All  accesses  are  well  serialized. 

In  the  past  few  years,  people  have  studied  possible  improvements  to  this  situation  with  the 
introduction  of  buffered  memory  accesses  [3].  Write  and  read  buffers  are  used  in  order  to  make 
the  cache  lock-up-free  .  It  was  shown  in  [3]  that  buffering  was  particularly  efficient  for  highly 
pipelined  machines  and  for  high  miss  rates.  In  a  second  paper  [10]  the  same  authors  show  that 
the  consistency  conditions  are  not  that  easy  to  meet,  and  conclude  that  access  buffering  can  be 
allowed,  but  invalidation  buffering  (in  snoopy  cache  type  of  protocols)  must  be  forbid.  They 
propose  a  protocol  that  relies  upon  adjoining  a  new  state  to  each  entry,  that  allows  or  disables 
further  accesses:  If  an  entry  is  being  updated,  any  further  reads  or  writes  by  the  processor  must 
be  rejected  or  deferred  until  the  update  is  completed. 

In  our  project  we  were  not  only  interested  in  taking  advantage  of  buffering,  but  also  of  a 
better  communication  medium,  and  therefore  investigated  the  consequences  of  using  a  bus  with 
split  transactions.  Such  a  bus  has  several  advantages: 

•  A  bus  transaction  is  composed  of  an  information  transfer  phase  and  a  decoding  phase.  The 
decoding  phase,  which  typically  includes  an  access  to  a  memory  circuit,  is  the  most  time 
consuming  operation  and  does  not  require  the  use  of  the  bus  itself.  In  a  split- transaction 
bus  it  is  possible  to  overlap  the  decoding  phase  with  other  transfers  and  thereby  increase 
the  bandwidth.  This  will  actually  increase  the  performance  by  a  factor  of  at  least  two, 
therefore  allowing  a  larger  number  of  connected  processors. 

•  A  split-transaction  bus  allows  a  pipelined  protocol  and  therefore  allows  pipelining  the 
treatment  of  bus  transactions,  making  the  decoding  phase  in  particular,  non-critical.  The 
cache  hardware  is  then  simpler  and  more  efficient. 

On  the  other  hand,  substantial  complexity  is  added  to  the  cache  coherency  protocol.  Bus 
transactions  are  no  longer  atomic,  and  therefore  the  state  of  a  line  in  a  cache  may  change  during 
the  processing  of  the  bus  operation.  For  instance,  if  a  cache  C\  owns  a  dirty  copy  of  the  location 
X,  and  cache  C2  requests  this  same  piece  of  data,  then  C\  must  write  its  copy  back.  The  copy- 
back  operation  is  not  immediate  but  takes  some  cycles,  during  which  any  further  requests  to 
this  same  location  traveling  on  the  bus  must  be  intercepted  until  the  write-back  is  completed. 
This  requires  upgrading  the  entry  state  information  to  allow  the  cache  automaton  to  behave 
properly.  We  have  attempted  to  exhaust  most  possible  variations  on  the  cache  protocol  using  a 
split  transactions  bus.  It  opens  the  door  for  many  small  optimizations  that  we  carefully  studied, 
reaching  the  conclusion  that  most  of  them  were  not  worth  the  added  complexity.  The  next 
subsection  gives  an  overview  of  the  cache  architecture,  and  is  followed  by  a  paragraph  that  gives 
a  precise  description  of  our  cache  protocol,  along  with  a  list  of  the  acceptable  optimizations. 
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2.1  The  architecture  of  the  cache 

A  memory  access  issued  by  the  processor  is  composed  of  an  operation  (Read,  Write),  an  address, 
and  possibly  a  piece  of  data.  This  is  sent  in  one  cycle  to  the  cache  as  a  token,  and  the  processor 
then  moves  its  attention  to  another  instruction.  This  memory  token  may  either  hit  in  the  cache, 
in  which  case  an  acknowledgment  token  is  sent  to  the  processor,  or  may  miss,  in  which  case 
a  token  is  sent  toward  the  memory  via  the  bus.  This  token  eventually  comes  back  from  the 
memory,  appears  on  the  bus,  and  is  caught  by  the  corresponding  cache.  It  is  then  handled  and 
a  new  token  is  produced  and  sent  to  the  processor  as  an  acknowledgment. 

To  accommodate  this  behavior  we  have  developed  a  cache  which,  seen  from  the  outside,  has 
two  input  ports  and  two  output  ports  (one  pair  on  the  processor  side,  and  the  other  pair  on 
the  bus  side)  and  a  small  memory  array.  Attached  to  the  bus  output  port,  a  fifo  queue  allows 
tokens  to  be  sent  out  without  worrying  about  a  possibly  busy  bus.  Another  small  fifo  buffer  is 
put  on  the  processor  input  path  to  accommodate  the  situation  where,  at  a  given  cycle,  a  bus 
acknowledgment  token  arrives  while  the  cache  is  responding  to  a  request  that  hits  in  the  cache 
memory.  Each  memory  address  X  maps  to  a  cache  entry  denoted  X  which  records  the  address 
and  data  cached,  as  well  as  the  current  state  of  the  cache  entry  and  some  additional  bits  e.g., 
to  help  with  synchronization  or  garbage-collection.  In  the  rest  of  the  paper  no  assumption  will 
be  made  about  the  memory-to-cache  mapping;  the  cache  can  be  a  one-way  associative  cache  as 
well  as  a  many-way  associative  cache.  Nevertheless  we  will  use  as  a  convention  that  X  represents 
any  address  that  maps  onto  the  same  entry  as  X,  and  X  is  any  such  address  that  will  require 
a  cache  replacement  (in  a  one-way  associative  cache,  X  =  X  -  {X}).  The  architecture  and  the 
operation  of  the  cache  itself  can  be  described  in  terms  of  the  following  three  modules: 

•  A  local  part  in  charge  of  processing  requests  performed  by  the  processor,  essentially  by 
doing  a  cache  lookup.  As  a  result  of  this  processing,  this  module  may  either  send  a 
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request  to  the  cache  manager  or  simply  send  a  token  to  the  fifo  queue  attached  to  the  bus 
output  port  in  order  to  perform  a  memory  access. 

•  A  global  part  which  constantly  watches  the  bus  for  acknowledgments  or  for  requests  from 
other  processors.  Each  bus  transaction  therefore  requires  a  cache  lookup.  This  module 
may  then  either  send  a  token  directly  onto  the  bus  (bypassing  the  fifo  buffer)  along  with 
the  activation  of  the  Kill  signal  (this  happens  when  it  detects  a  request  to  a  datum  that 
wasn’t  yet  written  back  to  the  memory),  or  send  a  request  to  the  cache  manager  to  update 
an  entry.  This  module  may  also  activate  a  Shared  signal,  to  notify  the  other  processors 
that  the  data  being  acknowledged  on  the  bus  is  shared. 

•  A  cache  manager  which  gets  requests  from  both  the  local  part  and  the  global  part  .  It  per¬ 
forms  a  local  operation  on  the  cache  array  (Read,  Write,  State  change)  and  if  needed  sends 
tokens  to  the  Local  part  or  the  Global  part  which  then  propagate  them  to  the  processor 
fifo  or  to  the  bus  fifo. 

At  any  given  clock  cycle,  the  cache  can  get  a  request  from  both  the  processor  and  from  the  bus. 
In  fact,  at  each  active  bus  cycle  the  global  part  of  the  cache  (which  constantly  observes  the  bus) 
must  access  the  cache  array  in  order  to  determine  its  action.  Then  it  may  modify  the  cache 
state  (the  previous  lookup  and  this  modification  must  occur  atomically,  for  instance  in  the  same 
cycle).  In  parallel  with  the  treatment  of  bus  transactions,  the  processor  may  read  or  modify  an 
entry  in  the  cache.  These  three  actions  that  potentially  occur  within  the  same  clock  cycle  must 
nevertheless  be  logically  serialized  at  the  cache  array  level  and  are  therefore  processed  in  the 
following  order:  First  the  cache  entry  is  read  by  the  snoopy  unit,  then  a  possible  modification  is 
reported  by  the  same  unit,  finally  the  processor  request  is  handled.  The  actual  hardware  may 
of  course  perform  all  this  in  parallel  as  long  as  the  logical  sequence  is  preserved. 

2.2  Description  of  the  cache  protocol 

In  order  to  take  proper  actions  upon  local  part  and  global  part  requests,  the  cache  manager 
must  maintain  some  information  about  the  current  consistency  state  of  each  cache  datum.  The 
conventional  snoopy  cache  protocol  requires  the  three  following  fundamental  states  to  describe 
a  given  cache  entry: 

•  Private  and  clean  state  (PC)  which  means  that  the  data  is  held  locally  and  is  an  exact 
copy  of  the  corresponding  location  in  memory.  The  associated  cache  is  the  only  one  that 
has  a  copy.  The  corresponding  processor  can  therefore  perform  all  the  operations  locally. 

•  Private  and  dirty  (PD)  which  means  that  the  data  is  held  locally,  and  was  modified  by  the 
associated  processor.  The  copy  is  therefore  not  consistent  with  the  one  in  memory.  The 
associated  cache  is  the  only  one  that  caches  this  location.  The  corresponding  processor 
can  still  perform  all  the  operations  locally,  but  the  cache  is  responsible  for  updating  the 
memory  as  soon  as  another  processor  wants  to  read  this  location. 

•  Shared  (S)  which  means  that  the  data  is  held  locally  and  in  some  other  cache  as  well.  The 
copy  is  also  consistent  with  the  corresponding  memory  location.  The  attached  processor 
can  still  perform  locally  the  Read  operations  but  all  Writes  must  be  broadcast. 

The  fact  that  memory  Writes  and  Reads  are  buffered  requires  adjoining  the  following  state: 
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•  Wait  for  bus  transaction  (WB)  which  means  that  the  cache  entry  is  being  written  to  or 
read  from  the  memory,  and  that  a  bus  transaction  was  requested  but  the  memory  access 
has  not  yet  traveled  on  the  bus.  Any  further  request  to  this  cache  entry,  from  the  attached 
processor,  is  rejected  until  the  corresponding  acknowledgment  token  arrives. 

Since  we  allow  optimizing  the  memory  accesses  by  observing  and  using  the  acknowledgment 
tokens  which  appear  on  the  bus  and  match  outstanding  memory  requests  that  are  still  sitting  in 
the  output  buffer,  we  need  the  following  state  that  indicates  whether  an  access  can  be  optimized 
or  not. 

•  Wait  for  a  memory  acknowledgment  (WA)  which  means  that  the  memory  access  has  trav¬ 
eled  on  the  bus,  but  the  memory  acknowledgment  wasn’t  yet  observed  on  the  bus.  Any 
further  request  to  this  cache  entry,  from  the  attached  processor,  is  rejected  and  the  cache 
takes  advantage  of  any  acknowledgment  token  that  appears  on  the  bus  and  matches  the 
address  of  this  outstanding  memory  access. 

The  fact  that  transactions  are  split  implies  that  write-backs  are  neither  atomic  nor  instantaneous, 
therefore  requiring  this  intermediate  state. 

•  Wait  for  update  acknowledgement  (WU)  which  means  that  an  update  access  was  sent  on 
the  bus  but  the  corresponding  memory  acknowledgment  hasn’t  appeared  yet.  Any  further 
request  to  this  entry  that  appears  on  the  bus  must  be  cancelled  (by  activating  the  kill 
signal),  and  any  request  to  this  entry  from  the  attached  processor  is  rejected. 

Finally,  the  obvious  empty  or  invalid  state  (E)  reflects  the  fact  that  at  startup  time  a  cache 
entry  contains  no  valid  data.  Since  our  protocol  uses  no  invalidations,  the  (E)  state  is  never 
reentered.  The  above-listed  states  determine  the  actions  to  be  taken  by  the  cache  manager  upon 
occurrence  of  processor  or  bus  events.  The  complete  behavior  of  the  cache  is  therefore  fully 
specified  by  the  description  of  its  responses  to  each  event  or  token.  A  token  may  induce  either 
a  state  change  (see  figure  2),  or  an  action  such  as  the  creation  of  a  new  token  (see  table).  There 
are  mainly  two  kinds  of  tokens: 

•  Requests  from  the  attached  processor,  which  are  Reads  or  Writes  to  a  given  location,  will 

be  denoted  by  P{  :  R@X  or  P{  :  which  stands  for  processor  Pi  (by  convention  P, 

means  the  attached  processor,  Pj  stands  for  any  processor  but  the  attached  one  and 
stands  for  any  processor)  reading  (respectively  writing)  at  (to)  location  X.  As  a  convention, 
X  will  represent  the  address  that  is  cached,  X  any  address  that  causes  a  replacement  of 
X’s  cache  entry,  and  X  stands  for  any  address  that  maps  to  the  same  cache  entry  as  X. 

•  Tokens  that  travel  on  the  bus,  which  can  be  either  requests  from  another  processor  ( P:  : 
R,  WQX)  or  write-back  requests  (updates)  from  another  processor  (Pj  :  U@X )  or,  more 
likely,  acknowledgment  tokens  Pk  :  Ra,Wa,Ua©X  representing  for  a  read/write/update 
acknowledgement  tokens  respectively.  This  notation  is  extended  to  reflect  whether  the 
Shared  signal  is  actived  or  not  upon  the  observation  of  the  acknowledgment  token.  This  is 
done  by  adding  a  (shared)  or  “.p”  (private  =  not  shared)  modifier  at  the  end  of  the 
token  notation  string. 

We  use  this  notation  extensively  in  the  rest  of  this  paper.  The  cache  manager  is  then  completely 
specified  with  the  transition  diagram  represented  in  figure  2  and  the  table  of  actions  that  sum- 
merizes  the  operations  triggered  by  various  events.  Nevertheless,  the  following  verbal  description 
of  the  sequence  of  actions  that  take  place  for  a  given  memory  request  helps  in  understanding 
the  overall  behavior: 
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•  Read  hit:  When  a  processor  performs  a  read  operation  that  hits  in  its  private  cache,  the 
cache  simply  sends  to  the  processor  an  acknowledgment  token  that  contains  the  desired 
data  (the  data  can  be  either  in  PC.  PD.  or  S  state).  If  the  entry  is  in  an  intermediate 
state  (\VA,  WB.  or  WU)  then  the  access  is  rejected,  and  the  cache  sends  a  rejection  token 
to  the  processor,  which  will  then  reissue  the  request  later. 

•  Write  hit:  If  the  entry  is  in  a  private  state,  then  it  is  updated  with  the  new  value  and  the 
state  is  changed  to  PD  if  it  were  originally  in  the  PC  (clean)  state.  If  the  entry  was  in  S 
(shared)  slate,  then  the  write  is  broadcast  by  sending  a  token  to  the  output  fifo  buffer  and 
entering  the  WB  (wait  to  obtain  the  bus)  state.  At  this  point  if  another  acknowledgement 
token  with  the  same  address  stamp  is  detected  on  the  bus.  the  corresponding  value  updates 
the  cache  entry  and  an  acknowledgment  token  is  sent  to  the  processor.  The  cache  entry  gets 
then  into  the  S  (shared  state)  again.  If  not.  the  request  is  eventually  sent  onto  the  bus  and 
the  WA  state  is  entered.  Two  things  may  then  happen:  Either  the  acknowledgment  token 
appears  on  the  bus.  or  an  update  acknowledgment  token  appears.  The  second  case  occurs 
when  the  processor  write  was  sent  while  an  outstanding  write-back  was  being  processed. 
In  this  case  1  the  processor  which  was  responsible  for  this  update,  and  which  therefore 
was  in  the  PD  or  WU  state,  has  killed  the  regular  acknowledgment  token,  and  the  update 
acknowledgment  is  to  be  taken  into  account  instead.  As  a  result,  the  write  operation  by 
our  original  processor  will  never  be  observed.  In  case  the  acknowledgment  token  does 
appear  on  the  bus.  any  processor  with  a  different  number  and  which  owns  a  copy  of  the 
same  address  in  a  stable  state  (PC.  PD.  S)  must  notify  this  cache  by  turning  the  shared 
signal  on  (severed  processors  can  do  this  concurrently).  In  this  case,  the  cache  entry  gets 
into  the  S  state.  If  the  shared  wire  is  not  turned  on  then  the  cache  entry  gets  into  the 
PC  state.  Then  an  acknowledgment  token  is  sent  to  the  attached  processor.  Finally,  if 
the  write  hit  occurs  during  an  intermediate  state  (WA.  WB.  WU)  then  a  rejection  token 
is  simply  sent  to  the  processor. 

•  Read  miss:  In  case  the  corresponding  entry  is  in  a  dirty  state,  then  the  replacement  cannot 
occur  until  the  old  cached  date  is  written  back.  An  update  token  is  sent  to  the  output 
fifo  buffer,  the  WU  state  is  entered  and  a  rejection  token  is  sent  back  to  the  processor  so 
that  it  reissues  its  request  later.  While  in  the  WU  state,  any  request  to  the  corresponding 
address  that  appears  on  the  bus  is  cancelled,  until  the  update  acknowledgment  is  received 
and  the  S  state  entered.  If  the  old  data  is  in  a  PC  or  S  state  then  the  replacement  can 
occur,  a  read  token  is  sent  to  the  output  buffer  to  request  the  data  from  the  memorv 
and  the  WB  state  is  entered.  As  in  the  previous  case,  if  any  acknowledgement  token  that 
carries  the  same  address  stamp  appears  on  the  bus.  it  is  accepted  as  acknowledgment  and 
the  cache  enters  the  S  state.  Otherwise,  the  token  is  sent  on  the  bus  and  the  WA  state  is 
entered.  The  rest  of  the  protocol  is  as  for  the  write  hit. 

•  W  rite  miss:  as  for  Read  miss 

'Such  a  strange  case  is  possible  because  of  buffering.  Processor  P,  was  in  WB  state,  and  meanwhile  processor 
P,  got  the  same  entry  from  the  memory  but  in  the  PC  state  because  P,.  not  being  in  a  stable  state,  hadn't  turned 
the  shared  signal  on.  By  the  time  the  write  token  from  P,  appeared  on  the  bus.  P.  got  into  the  PD  state.  P. 
is  then  responsible  for  updating  the  memory  and  it  cancels  P,' s  request,  gets  the  next  bus  cycle  and  sends  an 
update  token  bypassing  its  fifo  buffer. 
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WUQX  1  WAQX  |  WBiX 

S@X  |  PC@X  |  PDOX 

P.  :  R@X 

P,  :  W@X 

P,  :  R,W@X 

P}  :  R@.Y 

P}  :  W@X 

read  token  sent 
to  output  fifo 

rejection  token  sent  to  processor 

data  fetched  from  cache  and  token  sent  to 
processor 

write  to¬ 

ken  sent  to  out¬ 
put  fifo 

rejection  token  sent  to  processor 

write  token 

sent  to  out¬ 
put  fifo 

cache  updated  and  token  sent 
to  processor 

read/ write  to¬ 
ken  sent  to  out¬ 
put  fifo 

and  cache  up¬ 
dated  with  new 
address 

rejection  token  sent  to  processor 

read  /write  token  sent  to 
output  fifo  and  cache  up¬ 
dated  with  new  address 

rejection  token 
sent  to  proces¬ 
sor  and  update 
token  sent  to 
output  fifo 

ignored 

kill  sent  ignored 

ignored 

ignored 

state  changes 
but  cache  data 
not  updated 

Pi  :  Ra,Wa@X 

P,  :  Ua@X 

impossible 

cache  up¬ 
dated  and 
token  sent 
to  proces¬ 
sor 

impossible 

cache  up¬ 
dated  and 
shared  sig¬ 
nal  sent 

impossible 

Pj  :  Ra,Wa@X 
P,  :  Ua@JY 

ignored 

kill  sent 

ignored  cache  updated,  shared  signal  sent 

kill  sent  | 

ignored 

impossible 

cache  updated 

impossible 

cache 

updated 

Table:  Actions  taken  by  the  cache  upon  bus  and  processor  events. 


2.3  Correctness  proofs 

Most  currently  published  cache  protocols  are  not  formally  proved  because  of  their  simplicity. 
Nevertheless  our  cache  is  substantially  more  complex  than,  for  instance,  the  standard  snoopy 
cache,  and  therefore  one  can  think  of  many  complex  scenarios  that  might  fool  the  automaton. 
We  first  made  sure  that  the  automata  would  never  end  up  locked  in  a  waiting  state  and  for  that 
purpose  proved  the  following  lemma: 

Correctness  lemma  1  (Deadlock  free  automaton)  For  each  transaction  token  sent  to  the 
output  fifo  buffer  there  is  a  unique  acknowledgment  token  that  travels  on  the  bus  (the  automaton 
will  therefore  never  stay  in  a  waiting  state) 

Proof:  The  proof  of  this  statement  is  done  by  studying  each  type  of  bus  transaction: 

•  P{  :  U@X  is  sent  either  directly  onto  the  bus  or  to  the  fifo  buffer,  but  it  its  guaranteed 
to  be  sent  on  the  bus.  It  then  can  only  be  acknowledged  by  the  corresponding  update 
acknowledgement  token. 

•  Pi  :  R,W@X  is  sent  by  the  processor  to  the  output  fifo  buffer  then  (a)  it  does  not 
get  to  the  bus,  in  which  case  it  was  acknowledged  by  a  token  stamped  with  the  same 
address  (b)  it  does  get  to  the  bus  but  get  killed  in  which  case  there  is  another  processor 
that  started  an  update  access  and  the  corresponding  acknowledgment  token  has  not  yet 
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appeared  on  the  bus.  An  update  token  is  always  acknowledged,  therefore  the  access  will 
be  acknowledged  as  well,  (c)  it  does  get  to  the  bus  and  does  get  to  the  memory  without 
being  killed,  the  memory  will  then  issue  an  acknowledgment  token.  This  token  is  the 
only  one  that  will  be  accepted  by  the  automata:  since  the  access  was  not  killed,  there 
was  no  outstanding  updates  by  the  time  the  request  was  sent  on  the  bus.  Therefore,  any 
update  acknowledgement  that  may  appear  on  the  bus  correspond  to  subsequent  update 
requests,  and  since  the  memory  is  a  fifo  server,  will  not  appear  on  the  bus  until  the  regular 
acknowledgment  token  is  sent  (no  unexpected  acknowledgment  will  appear  on  the  bus). 

Correctness  lemma  2  (Correct  state  semantics)  If  a  cache  entry  is  in  a  private  state  (PC 
or  PD)  then  no  other  cache  can  be  in  a  stable  state  (PC,  PD  or  S)  at  the  same  time  (relatively 
to  the  global  clock  provided  by  the  bus  cycles). 

Proof:  Trivial  when  considering  the  cycle  at  which  the  cache  first  got  the  data  (which  necessarily 
involved  a  bus  access). 

Correctness  lemma  3  (Sequential  consistency)  The  system  of  caches  is  sequentially  con¬ 
sistent. 

Proof:  this  proof  which  relies  on  the  formalism  presented  in  the  next  section,  is  given  in  section 
3.4. 


2.4  Further  improvements 

The  cache  that  we  described  up  to  now  supports  a  very  simplified  set  of  memory  instructions. 
In  particular  it  does  not  provide  any  Test-and-set  type  of  operations,  which  makes  it  inefficient 
when  supporting  a  programming  language  that  uses  storage  locations  for  synchronization  like 
Multilisp.  Furthermore  it  does  not  support  any  kind  of  memory  error  handling,  which  is  essential 
if  a  large  system  with  a  virtual  address  space  is  used.  We  have  nevertheless  investigated  what 
would  be  involved  in  modifying  our  cache  to  work  in  a  real  machine: 

•  Test-and-set  instructions:  in  Multilisp  such  synchronization  is  performed  with  the  use  of 
a  Full/Empty  bit  that  is  attached  to  each  memory  location  [7].  Standard  Read/ Write 
instructions  are  then  extended  to  a  variety  of  operations  such  as  read  if  location  is  full  or 
write  and  leave  location  empty  .  These  operations  can  actually  in  most  cases  be  treated 
as  a  simple  Read  or  Write  operation,  but  they  no  longer  allow  “on  the  fly”  optimizations. 
The  cache  automaton  should  therefore  be  modified  as  if  the  WB  state  were  projected  onto 
the  WA  state,  and  the  actions  taken  by  the  cache  manager  for  Reads  that  side-effect  a 
memory  location  should  be  treated  as  writes.  The  new  system  obtained  is  then  write- 
ordered.  In  fact,  with  added  complexity  to  the  cache  hardware,  it  turns  out  that  ”on  the 
fly”  optimizations  can  still  be  performed,  but  this  is  beyond  the  scope  of  this  paper. 

•  Virtual  memory  features:  For  a  virtual-address  cache,  an  additional  acknowledgment  token 
must  be  introduced  and  recognized  by  the  automaton  in  order  to  treat  memory  errors.  The 
current  automaton  should  still  work  as  long  as  the  processor  that  swaps  memory  pages  in 
and  out  operates  on  the  same  bus  with  the  same  protocol  provided  that  memory  errors 
are  treated  by  the  cache  automaton  as  normal  acknowledgments. 


12 


3  Memory  consistency,  a  theoretical  framework: 

For  efficiency  purposes,  constraints  regarding  the  ordering  of  transactions  may  get  relaxed, 
leading  to  a  more  efficient  use  of  the  hardware  (more  parallelism)  but  getting  also  easily  subject 
to  protocol  design  errors,  and  therefore  end  with  incorrect  program  execution.  The  purpose  of 
this  study  is  to  review  certain  properties  regarding  the  ordering  of  memory  transactions  in  a 
multiprocessor  system,  which  are  commonly  mentioned  to  satisfy  the  usual  programming  models. 
A  formalism,  using  partially  ordered  sets,  is  proposed  to  define  cache  coherence,  sequential 
consistency ,  and  write  ordering  in  a  unified  context.  Along  with  this  formulation,  a  theorem  is 
provided  and  proved,  aimed  at  helping  the  designer  in  proving  that  a  given  system  satisfies  any 
of  the  above  properties. 

3.1  On  formalizing  memory  transactions 

Running  a  given  parallel  program  on  a  multiprocessor  system  may  lead  to  several  different  exe¬ 
cutions.  An  execution  is  characterized  by  its  Trace ,  which  is  the  partially  ordered  set  (possibly 
infinite)  of  all  memory  transactions  that  are  performed  in  the  course  of  the  run.  Although  we 
are  rarely  interested  in  a  single  trace,  a  legal  execution  is  often  defined  as  a  class  of  traces, 
characterized  by  some  constraints.  These  constraints  have  to  do  with  both  the  sequential  se¬ 
mantics  of  the  programming  language  (which  will  define  the  behavior  of  individual  processors), 
and  its  parallel  semantics,  which  will  impose  constraints  on  the  order  of  the  operations  in  order 
to  preserve  the  sequential  aspect  (data  dependencies,  sequential  sections)  of  the  execution  of 
parallel  programs.  In  this  section  we  define  several  partial  ordering  relations  on  the  set  of  all 
transactions  performed  by  the  system.  These  orders  reflect  the  usual  dependencies  between 
memory  instructions,  within  subsets  of  all  memory  transactions  (order  at  a  memory  location 
...).  To  achieve  this  goal,  we  need  first  to  define  the  following  objects: 

•  T  is  the  set  of  all  memory  transactions  performed  by  the  system 

•  72.  is  the  subset  of  Read  transactions  :  TZ  C  T 

•  W  is  the  subset  of  Write  transactions  :  VV  C  T 

•  Pi  denotes  a  Processor/Cache  ensemble.  Pi  €  V  =  [0,  M axproc] 

•  Aj  denotes  a  global  memory  address.  Aj  €  A  =  [0,  Maxadr] 

•  V  denotes  the  set  of  all  possible  values  that  can  be  stored  in  a  memory  location. 

•  IT  :  T  — *  V  is  a  function  that  canonically  projects  the  set  of  transactions  on  the  set  of 

processors.  (Each  transaction  A'  is  performed  by  the  unique  processor  n(A').) 

•  A  :  T  — *  A  is  a  function  that  canonically  projects  the  set  of  transactions  on  the  set  of 
Addresses.  (Each  transaction  X  is  performed  on  the  unique  address  A(AT).) 

•  E  :  T  — ♦  V  is  a  function  that  canonically  projects  the  set  of  transactions  on  the  set  of 
values  (Each  transaction  A  reads  or  writes  the  unique  value  £(Ar).) 

•  ftp,  :  T  — *  {0,1}  defines  a  family  of  characteristic  functions  that  model  the  fact  that  a 
write  transaction  may  or  may  not  be  observed  2  by  the  processor  Pi  . 

2An  observation  is  to  be  defined  by  the  designer.  Nevertheless  it  is  assumed  that  all  the  transactions  performed 
by  a  given  processor  are  observed  by  this  same  processor. 
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We  assume  that  within  a  processor  Pi  there  is  some  kind  of  global  synchronization  (a  single 
clock,  or  all  memory  acknowledgments  arrive  via  a  single  stream,  or  there  is  a  single  instruction 
decode  unit,  etc,  ...).  Thus,  even  in  a  multi- threaded  processor  or  in  a  processor  with  multiple 
memory  ports,  we  assume  that  only  a  single  instruction  can  be  completed  at  a  given  clock  cycle. 
Pi  executes  instructions  in  some  sequence.  Therefore,  it  can  be  assumed  that  there  is  a  total 
order  on  the  subset  Il-1(Pj  )  of  all  memory  transactions  performed  by  P,  .  We  denote  this  total 
order  <pr 

A  given  location  Aj  in  memory  observes  the  requests  performed  on  it  in  some  order.  For 
instance,  in  a  system  where  a  given  address  is  always  mapped  to  the  same  actual  location  (no 
cache  for  example),  all  the  transactions  involving  Aj  are  serialized  at  the  memory  cell,  and 
therefore  are  totally  ordered.  Nevertheless,  this  order  is  more  often  only  a  partial  order  (for 
instance  where  caches  maintain  copies  of  the  contents  of  the  location  in  memory).  Let  <Aj  be 
this  order  defined  on  A _1(Aj  ).  This  order  is  likely  to  be  consistent  with  <pt  (If  rx  is  performed 
before  t-i  by  processor  Pi  both  at  the  address  Aj  ,  then  rj  <a,  T2)  but  this  is  not  always  the 
case  (for  instance  if  the  communication  network  does  not  preserve  the  order  of  messages). 

Finally,  there  is  an  order  derived  from  the  programming  language  itself.  The  latter  allows 
sequential  sections,  and  therefore  all  the  instructions  within  such  a  section  are  totally  ordered. 
In  a  language  such  as  Multilisp  [5],  there  is  a  notion  of  parent  and  child  tasks,  which  also  impose 
a  hierarchy  and  therefore  a  partial  order  (transactions  performed  by  a  child  are  subsequent, 
greater,  than  those  from  the  parent  task  that  were  performed  before  the  task  creation.  Trans¬ 
actions  from  the  parent  task  that  are  subsequent  to  the  child  creation  are  greater  that  any 
transaction  performed  by  the  child).  Let  <prog  be  this  order  defined  on  the  entire  set  T  .  The 
above  discussion  can  be  summarized  as  follow: 

•  <pj  is  a  total  order  on  the  transactions  performed  by  the  processor  P,  . 

•  is  a  partial  order  on  the  transactions  performed  at  the  address  Aj  . 

•  —prog  is  a  partial  order  on  the  transactions  implied  by  the  ordering  of  instructions  within 
the  program. 

3.2  Defining  cache  consistency 

In  this  section,  various  consistency  properties  are  studied.  Their  usual  definitions  are  interpreted 
and  reformulated  in  different  terms,  in  order  to  obtain  a  unified  formulation  of  these  properties. 
The  general  idea  is  that  a  consistency  property  is  equivalent  to  the  existence  of  some  order  on 
the  set  of  all  transactions  resulting  from  a  given  execution. 

Definition  1  (Cache  coherence)  A  system  of  caches  is  said  to  be  coherent  with  respect  to 
block  A  if  each  cache  in  the  system  observes  all  modifications  of  A  in  the  same  order. 

This  definition  given  in  [2]  relies  upon  the  meaning  of  “to  observe”  and  therefore  comes  with 
the  following  definition:  A  modification  to  a  block  A  is  observed  by  cache  Ci  if  any  subsequent 
read  by  processor  Pi  will  return  the  newly  written  value. 

This  latter  definition  assumes  an  obvious  notion  of  “subsequent”  among  the  transactions, 
which  implies  the  existence  of  some  global  order.  Along  with  the  definition  of  cache  coherence, 
this  even  appears  badly  formulated  since  cache  coherence  is  about  a  notion  of  a  global  order. 
Moreover,  the  definition  of  cache  coherence  states  that  “all  modifications”  are  observed  which 
is  not  the  case  in  most  cache  systems.  As  a  matter  of  fact,  in  some  cases  each  processor  only 
observes  the  order  of  its  own  transactions.  Thus,  we  prefer  the  following  alternate  definition, 
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which  derives  from  a  reordering  of  words  in  the  previous  formulation  while  remaining  faithful 
to  its  spirit. 

Definition  2  (Cache  coherence  revisited)  A  system  of  caches  is  said  to  be  coherent  with 
respect  to  block  A  if  for  any  execution,  there  is  a  total  order  <a  on  the  resulting  set  of  all 
memory  transactions  performed  at  A  that  satisfies: 

•  Vri,r2  G  A_1(A)  such  that  3i  such  that  Ti<pj2  =>  n<AT2  (The  order  is  compatible 
with  the  order  at  each  processor) 

•  Vr  £  A_1(i4)n  'll  E(sup{u>  £  WnA-1(i4)  such  that  w  <A  r}  =  £(r)  (A  read  in  this 
sequence  returns  the  value  of  the  most  recent  write) 

In  general  this  order  is  not  unique.  For  instance  in  a  two-processor  coherent  system,  an  execution 
may  well  give  the  following  two  traces3  : 

A :  W@X=2  P1:R@X=3  (Processor  1) 

P2 :  W@X=3  P2 ;  W@X=3  (Processor  2) 
which  may  lead  to  several  coherent  total  orders  : 

1/  Px:W@X=2  P2:W@X=3  Pi:R@X=3  P2:W@X=3 
2/  P2:W@X=3  Pi:W@X=2  P2:W@X=3  Pi  :R@X=3 

Definition  3  (Sequential  consistency)  A  system  is  sequentially  consistent  if  the  result  of 
any  execution  is  the  same  as  if  the  operations  of  all  processors  were  executed  in  some  sequen¬ 
tial  order,  and  the  operations  of  each  individual  processor  appear  in  this  sequence  in  the  order 
specified  by  its  program. 

Sequential  consistency  is  a  property  which  relies  upon  the  programming  model,  because  it  in¬ 
volves  “program  order”  which  is  a  notion  that  derives  completely  from  the  semantics  of  the 
programming  language.  Any  program  that  fits  into  this  model  may  lead  to  several  different 
executions  (some  may  terminate  some  may  even  not)  but  as  long  as  the  resulting  trace  can 
be  ordered  such  that  there  is  a  sequential  execution  which  leads  to  the  same  ordered  trace, 
sequential  consistency  is  satisfied.  An  alternate  definition  is: 

Definition  4  (Sequential  consistency  revisited)  A  system  is  sequentially  consistent  if  for 
any  execution,  there  is  a  total  order  <seqon  the  resulting  set  of  all  memory  transactions  that 
satisfies: 

•  Vrj , r2  £  T  such  that  Ti<j„.0gT2  =>  T\  <„,r2  (The  order  is  compatible  with  the  order 
imposed  by  the  program) 

•  Vr  £  A_1(A)  n  H  E(sup  {tu  €  W  such  as  w  <4e,r)  =  £(r)  (A  read  in  this  sequence 
returns  the  value  of  the  most  recent  write) 

Lemma  1  Definition  (3)  and  (4)  are  equivalent. 

Proof  :  (1)  Definition  (3)  implies  definition  (4):  Definition  (3)  implies  the  existence  of  a  total 
order  on  the  set  of  all  instructions  and  therefore  on  the  set  of  all  memory  transactions.  This 
order  is  compatible  with  program  order  since  it  derives  from  a  sequential  execution  of  the  pro¬ 
gram.  For  the  same  reason,  the  value  returned  by  a  read  is  the  value  Written  by  the  last  write 
in  the  sequential  execution  order. 

3Notation:  P:T@A  =  V  means  that  processor  P  performs  the  transaction  T  (Read  or  Write)  at  the  address  A 
resulting  with  the  value  V  (being  Read  or  Written). 
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(2)  Definition  (4)  implies  definition  (3):  In  order  to  prove  that  the  existence  of  the  order 
<je?implies  the  existence  of  a  sequential  execution  of  the  program,  we  will  construct  this  execu¬ 
tion  by  induction.  We  assume  that  instructions  are  also  fetched  from  the  same  memory  system 
as  data  although  this  is  not  necessary.  The  parallel  run  that  leads  to  the  totally  ordered  trace 
T  starts  with  the  system  in  some  state.  Since  <ae?respects  program  order,  the  first  read  can 
perfectly  be  the  first  instruction  executed  in  the  sequential  run.  Suppose  now  that  the  sequential 
execution  that  leads  to  the  same  trace  T  up  to  the  step  k  is  built,  let  us  prove  that  this  execution 
can  be  completed  to  produce  the  transaction  Tk+1  .  In  the  parallel  run,  Tk+ \  was  executed  by 
some  processor  P,.  Up  to  this  point  processor  P;  was  able  to  perform  the  same  instructions 
and  the  same  memory  transactions  as  in  the  parallel  run.  In  particular  the  task  to  which  the 
transaction  7fc+i  belongs  is  in  the  same  state  as  it  was  during  the  parallel  run  just  before  per¬ 
forming  it  (the  state  of  the  task  which  consists  in  the  contents  of  some  part  of  the  memory  and 
some  registers  like  the  Program  counter,  depends  only  on  the  sequence  of  data  it  has  fetched 
from  the  memory  up  to  the  point  where  it  is  able  to  complete  the  next  instruction).  The  new 
program  counter  is  calculated  using  previously  read  data  within  the  current  task.  Therefore  the 
transaction  Tjfc+ 1  can  be  performed.  It  will  be  performed  at  the  same  address  with  the  same 
value.  Indeed,  if  this  transaction  is  a  write,  both  the  address  and  the  value  written  depend 
only  on  the  current  state  of  the  task  which  is  the  same  as  for  the  parallel  run.  If  this  is  a  read 
transaction,  the  value  read  is  the  value  that  was  written  the  most  recently  in  the  <je, sequence. 
Since  our  sequential  execution  has  already  performed  the  transactions  prior  to  step  k,  the  value 
read  will  be  the  same  as  the  one  read  during  the  parallel  run.  (One  should  note  that  this  proof 
does  not  require  that  the  programming  model  be  deterministic). 

Often,  program  order  is  the  coalition  of  two  orders  :  (1)  the  order  of  sequential  threads 
(2)  a  data-flow  order  (synchronization  variables,  child  tasks  created  by  putting  a  token  in  some 
queue  etc...  ).  The  latter  order  relies  only  upon  the  values  involved  in  the  memory  transactions. 
Therefore  the  second  condition  that  appears  in  definition  (4)  already  guarantees  that  the  data¬ 
flow  order  is  respected.  Thus,  as  long  as  each  processor  in  the  system  executes  instructions 
according  to  the  order  imposed  by  the  sequential  threads,  program  order  is  respected.  For  the 
case  where  program  order  is  composed  of  the  above-mentioned  two  orders,  definition  (4)  can  be 
reduced  to: 

Definition  5  A  system  is  sequentially  consistent  if  for  any  execution,  there  is  a  total  order 
<teqon  the  resulting  set  of  all  memory  transactions  that  satisfies: 

•  Vi  Vtx,T2  €  II_1(P,)  such  that  Ti<progr2  =>  T\  <ac?r2  (The  order  is  compatible  with 
the  order  imposed  by  the  program  on  the  transactions  performed  by  each  processor) 

•  Vr  €  A  }(A)  n  TZ  £(sup{tn  £  W  such  that  w  <#egr}  =  E(r)  (A  read  in  this  sequence 
returns  the  value  of  the  most  recent  write ) 

Sequential  consistency  does  not  imply  cache  coherence  as  it  may  appear  at  first.  Indeed,  nothing 
proves  that  the  orders  <pt  of  each  processors  transaction  is  respected  by  the  global  order  <>eq. 
For  instance  in  a  multi-threaded  processor  system,  <4e?may  perfectly  respect  <prog  but  not  the 
^Pr 

Definition  0  (Write  ordering)  A  system  is  write  ordered  if  all  caches  observe  all  writes  in 
the  same  order. 

Definition  7  (Write  ordering  revisited)  A  system  is  write-ordered  if  for  any  execution,  there 
is  a  total  order  <WritcOn  the  resulting  set  of  all  memory  transactions  performed  that  satisfies: 
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•  Vrx,r2  €  T  such  that  3t  /  Tj<p.r 2  =>•  rj  <writcT2  (The  order  is  compatible  with  the 
orders  of  each  processor) 

•  Vr  £  A-1(A)0  72.  E(sup{tu  €  A-1(A)W  such  that  w  <Writer})  =  £(r)  (A  read  in  this 
sequence  returns  the  value  of  the  most  recent  write) 

Assuming  that  each  processor  executes  instructions  in  program  order,  then  write  ordering  implies 
sequential  consistency.  The  converse  is  not  necessary  true. 

In  this  section  we  have  assumed  that  observed  order  and  processor  order  were  the  same.  A 
stronger  meaning  can  be  given  to  “observed  order”,  but  it  must  always  encapsulate  processor 
order.  Definition  (2)  and  (6)  can  then  be  restated  by  replacing  “processor  order”  by  “observed 
order”.  The  system  will  then  be  coherent  or  write  ordered  relatively  to  this  new  notion  of 
observation. 

3.3  On  Proving  Cache  Consistency 

The  computer  architect  can  usually  easily  prove  that  the  basic  processor  respects  program  order. 
Further  properties  of  consistency  may  be  difficult  to  prove.  The  purpose  of  this  section  is  to 
provide  a  technique  for  proving  consistency  properties.  The  computer  architect,  by  designing 
the  system,  imposes  some  constraints  on  the  way  that  memory  transactions  are  performed.  In 
particular,  by  putting  bottlenecks  (that  will  force  the  serialization  of  some  messages)  in  various 
places,  partial  orders  are  imposed  on  transactions.  By  chosing  an  appropriate  “consistent”  set 
of  partial  orders,  it  is  sometimes  possible  to  build  a  total  order  that  satisfies  all  of  the  partial 
orders.  The  following  characterization  provides  a  necessary  and  sufficient  condition. 

characterization  1  (Consistent  set  of  partial  orders)  Given  a  set  T  of  objects  and  a  fi¬ 
nite  family  {<,}  of  partial  orders,  there  is  a  total  order  that  respects  all  of  them  if  and  only  if 
any  sequence  :  xj  <„  X2  <,2  ...  <u  ifc+l  (chain  of  objects  where  two  adjacent  elements  are 
comparable  by  some  order)  does  not  loop  (that  is  :  x^+i  <ik+l  £1  is  impossible).4 

Often  a  partial  order  that  is  considered  is  a  total  order  on  the  subset  of  objects  to  which  it 
applies  (for  example,  the  order  </>■  applies  only  to  the  transactions  performed  by  P{  but  is  a 
total  order  on  this  subset  of  T  ).  If  all  the  orders  considered  are  such,  then  the  following  simpler 
characterization  can  be  used  : 

characterization  2  If  any  sequence  xj  X2  <<2  ...  <ik  x*+ 1  where  each  order  <j  appears 
at  most  once  is  not  a  loop  then  there  is  a  total  order  that  respects  each  partial  order. 

The  work  of  the  designer  is  then  reduced  in  both  imposing  some  orders  in  his  system  and  then 
finding  a  family  of  consistent  partial  orders.  For  instance,  if  program  order  is  respected  by 
each  processor  and  if  cache  coherence  is  already  proved,  write  ordering  can  be  proved  simply 
by  showing  that  the  family  composed  of  {<pj,  (orders  on  the  transactions  of  Pi)  and  {c^}, 
(order  due  to  the  coherence  of  the  system  relatively  to  each  address  Aj)  is  consistent. 

3.4  Proof  of  lemma  3 

We  first  prove  that  there  is  a  total  order  on  the  set  of  all  memory  transactions  which  is  compatible 
with  the  order  of  the  program: 

4for  a  more  precise  formulation,  see  theorem  (2). 
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•  We  can  define  a  bus  order  <j,u4  using  the  order  of  the  acknowledgment  tokens.  It  is  an 
order  on  all  the  transactions  that  were  sent  onto  the  bus  but  it  can  also  be  extended  to  all 
the  transactions  that  were  sent  into  an  output  fifo  buffer.  Indeed,  if  several  transactions 
are  acknowledged  by  the  same  token  before  they  get  onto  the  bus,  they  will  simply  be 
considered  as  being  simultaneous. 

•  The  order  of  each  processor  is  defined  as  being  the  order  of  the  acknowledgment  tokens 
that  arrive  in  its  input  buffer.  This  order  can  be  extended  to  local  program  order.  In 
fact,  all  write  transactions  that  are  acknowledged  before  they  get  onto  the  bus  are  never 
observed,  and  the  write  that  corresponds  to  the  bus  acknowledgment  is  observed  instead. 
We  simply  extend  our  order  to  insert  the  nonobserved  write  just  before  the  observed  one 
(no  read  operation  could  have  been  performed  in  between).  We  call  this  extended  order 
<,  which,  since  each  processor  executes  each  thread  in  program  order,  is  compatible  with 
the  program  order. 

It  is  easy  to  see  that  the  bus  order  is  compatible  with  each  of  these  local  orders  (T'i<j)UJT2  and 
T2<iT\  is  impossible).  Further  it  is  also  easy  to  see  that  any  chain  of  three  elements  where  each 
order  appears  only  once  cannot  be  a  loop.  Indeed,  suppose  we  have  T1<1T2<2T3<3T1  then  by 
definition,  Ti  and  T2  appeared  in  the  same  input  buffer,  and  T2  and  T3  both  appeared  in  another 
input  buffer.  Therefore,  T2  is  a  bus  transaction,  and  so  is  T3  (because  T3  and  T\  appeared  in 
a  third  input  buffer).  But  the  bus  order  can  then  compare  these  three  transactions,  and  since 
it  is  a  total  order  it  implies  that  T\  =j,UJ  T2  =bus  T3.  That  is  that  all  three  transactions  were 
acknowledged  by  the  same  bus  token.  In  the  exact  same  way  we  can  prove  that  there  is  no 
loop  of  length  k,  therefore  there  is  a  total  order  on  the  set  of  all  memory  transactions  that  is 
compatible  with  each  order  <,.  This  <  order  is  therefore  compatible  with  program  order. 

The  second  point  that  should  be  proved  in  order  to  demonstrate  sequential  consistency  is 
that  in  the  sequence  defined  by  the  above-constructed  total  order,  each  read  returns  the  value 
of  the  most  recent  write.  This  is  proved  by  analyzing  each  state  in  which  the  entry  could  have 
been  at  the  time  (relative  to  the  bus  clock)  of  the  read.  We  proceed  by  cases: 

•  (PC)  is  the  state  of  the  entry:  The  automaton  enters  in  the  PC  state  only  after  a  bus 

transaction.  The  value  read  is  then  the  value  that  was  written  in  the  cache  either  by  the 
most  recent  (relatively  to  <,)  acknowledgment  token  (Tai)  that  came  from  the  bus  (in 
which  case  it  was  in  the  WA  state  just  before),  or  by  the  attached  processor  (in  which  case 
it  was  in  the  PD  state  just  before,  and  a  token  T\  -  Pj  :  being  detected  on  the  bus 

forced  the  automaton  back  to  PC). 

-  Assuming  the  first  case,  between  the  moment  that  Ta\  arrived  and  the  time  of  read, 
no  other  acknowledgment  stamped  with  the  same  address  appeared  on  the  bus  (oth¬ 
erwise,  the  automata  would  not  have  stayed  in  the  PC  state).  Therefore,  Ta\  is  also 
the  most  recent  acknowledgment  relatively  to  <  order.  The  lemma  (2)  proves  that 
relatively  to  the  bus  clock  no  other  processor  has  written  to  this  address  since  Ta\  has 
traveled  on  the  bus.  The  current  read  then  reflects  the  last  write  (Ta0)  in  bus  order 
(since  the  memory  acts  as  a  fifo).  Suppose  that  some  other  processor  has  performed 
a  write  locally  between  Too  and  Ta\  relatively  to  <  putting  its  automata  into  the 
(PD)  state.  If  the  request  token  T\  that  corresponds  to  Ta\  was  sent  after  this  event 
then  the  write  back  would  have  worked  because  Pj  :  R©X  would  have  been  detected 
and  Ta0  would  not  be  the  most  recent  write.  If  T\  was  sent  before  and  T01  sent  after 
the  local  write,  then  the  write  back  would  have  also  worked  because  Pj  :  i?a,ITa@A' 
would  have  been  detected. 
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-  Assuming  the  second  case,  the  value  read  would  be  the  most  recent  one  relatively  to 
<;,  since  the  last  recorded  modification  Wo  was  performed  by  the  attached  processor 
itself.  In  any  case,  as  previously,  the  last  acknowledgment  token  Taj  in  bus  order 
stamped  with  the  corresponding  address  is  the  one  that  put  the  automata  into  the 
PC  state  before  it  went  into  the  PD  state.  At  the  corresponding  cycle  no  other  pro¬ 
cessor  had  a  copy  of  this  location,  therefore  any  subsequent  foreign  writes  would  have 
appeared  on  the  bus.  The  only  write  that  appeared  on  the  bus  (which  is  responsible 
for  the  transition  PD  to  PC)  is  not  yet  acknowledged,  therefore  the  attached  proces¬ 
sor  is  the  only  one  that  could  have  written  the  location  since  Tai  in  <  order.  Wo  is 
therefore  the  most  recent  write  in  <  order. 

•  (PD)  is  the  state  of  the  entry  then,  the  proof  is  exactly  as  for  the  second  case  of  the  PC 
state. 

•  (S)  is  the  state  of  the  entry,  then  the  value  read  is  the  one  that  was  recorded  in  the  cache 
by  the  last  bus  acknowledgment  token  Taj  in  both  bus  order  and  processor  order  (the 
automata  can  enter  the  S  state  only  after  such  a  token).  This  token  could  have  been  either 
an  update  acknowledgment  token  or  a  regular  acknowledgment  token. 

-  Assuming  the  first  case,  Tai  would  then  be  the  last  acknowledged  write  in  bus  order. 
Suppose  that  some  other  processor  Pj  made  a  local  write  W0.  In  <j  order  Wo  was 
preceded  by  a  bus  acknowledgment  token  Too  that  is  responsible  for  Pj* s  automata 
entering  the  PC  state.  If  we  have  Tao<Ta\<Wo  then  Tai  would  have  put  Pj' s 
automata  into  the  S  state  and  Wo  would  not  be  a  local  write.  If  Tai<Ta0  then  Tai 
would  not  be  tha  last  acknowledged  token  in  bus  order.  Therefore  Tai  is  also  the 
last  write  in  <  order. 

-  Assuming  the  second  case,  and  Ta\  being  the  last  acknowledged  read  token  in  bus 
order  (see  above  if  it  were  a  write  token).  The  above  proof  shows  that  no  local  write 
can  be  between  Tai  and  the  current  read  in  <  order.  Further  at  Taj’s  bus  cycle, 
another  processor  Pj  had  detected  the  token  on  the  bus  and  since  the  shared  line 
was  activated,  was  in  a  state  PC  or  S.  Pj' s  automata  has  entered  the  state  PC  or  S 
upon  a  bus  acknowledgement  token  Too  that  is  prior  to  Tai  in  bus  order.  We  can 
assume  that  this  is  the  most  recent  token  stamped  with  the  same  address  prior  to 
Tai,  otherwise  we  would  have  just  considered  the  corresponding  processor.  If  Tao 
is  a  write  acknowledgment  token,  it  is  obviously  the  most  recent  acknowledged  write 
in  <  order  (the  proof  is  similar  to  what  was  just  done  above).  Tai  would  therefore 
report  the  value  obtained  from  the  memory  which  is  the  one  written  by  Tao,  the 
proof  is  then  complete  is  this  case.  If  Ta0  is  a  read  acknowledgment  token,  then  we 
can  finish  our  proof  by  induction:  Suppose  that  we  have  proved  that  any  read  prior 
to  the  current  one  in  <  order  returns  the  value  of  the  most  recent  write.  Then  we  can 
apply  the  induction  to  Tao  and  we  know  that  Ta0  get’s  its  value  from  the  memory. 
Since  we  already  know  that  there  is  no  writes  between  Tao  and  the  current  read  in 
<  order,  we  are  sure  that  Taj  returns  also  the  same  value  as  Tao  which  is  the  last 
written  value. 


4  conclusion 

We  have  shown  in  this  paper  how  the  classical  Snoopy  cache  protocol  can  be  extended  to 
accommodate  the  needs  of  a  multi-threaded  processor  connected  to  the  memory  via  a  split- 
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transaction  bus.  This  was  achieved  by  using  buffered  accesses  which  required  adjoining  one  state 
to  the  classical  automaton,  to  indicate  to  the  requesting  processes  that  an  access  is  currently 
being  performed.  Further,  the  non-atomicity  of  the  transactions  on  the  bus  required  another 
state  to  safely  perform  the  write-back  operations.  Finally,  for  the  sake  of  optimization  we  have 
added  another  state  that  allows  certain  buffered  accesses  to  be  acknowledged  without  being 
even  sent  on  the  bus  simply  by  taking  advantage  of  the  other  transactions  passing  on  the  bus. 
Our  contribution  to  cache  protocols  resides  in  having  studied  this  particular  combination  of 
processor/memory  requirements,  and  to  our  knowledge  this  is  the  first  published  work  that 
addresses  the  problem  of  efficient  caching  on  split-transaction  buses.  Nevertheless,  this  quite 
specific  cache  can  itself  be  rather  easily  extended  to  work  with  many  communication  media  that 
use  non-atomic  transactions.  In  fact,  in  the  course  of  our  project  we  have  adapted  our  cache 
automaton  to  work  with  a  communication  medium  that  is  a  combination  of  an  interconnection 
switching  network  and  a  split-transaction  bus  (for  broadcasts).  In  the  latter  case,  part  of  the 
state  information  was  kept  in  the  memory  (a  very  simple  case  of  a  directory  scheme). 

It  appeared  that  the  inflation  in  the  number  of  automaton  states,  combined  with  an  increased 
number  of  transition-triggering  events  (due  to  non-atomic  transactions)  made  the  whole  design 
fairly  complex,  and  therefore  vulnerable  to  many  errors.  The  need  for  satisfactory  proofs, 
particularly  to  convince  ourselves  that  our  system  achieved  reasonable  consistency,  lead  us  to 
reexamine  the  problem  of  cache  coherency.  We  found  that  the  classical  statements  of  consistency 
formulated  in  natural  language  were  subject  to  many  misinterpretations  as  the  system  to  which 
they  were  applied  got  more  complicated.  This  was  a  motivation  for  developing  a  more  uniform 
formulation,  based  on  partial  orders  on  the  set  of  memory  transactions,  that  we  used  to  re-state 
the  classical  definitions  of  various  consistency  properties.  We  found  this  formalism  useful  for 
understanding  how  to  serialize  the  operations  in  our  design,  and  it  turned  out  to  be  a  valuable 
guide  for  proving  that  our  system  was  sequentially  consistent. 
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6  mathematical  appendix 

Given  a  countable  set  on  which  several  partial  orders  are  defined,  the  following  section  gives 
a  sufficient  condition,  that  guarantees  the  existence  of  a  total  order  respecting  all  the  partial 
orders. 

Definition  8  Let  X  be  a  set  and  a  thread  be  a  pair  (P,  <)  where  V  is  a  subset  of  X  and  <  is 
a  total  order  on  V.  <  has  a  canonical  extension  as  a  partial  order  on  X ,  and  P  will  be  called 
the  domain  of  <. 

Let  X  be  a  countable  set,  on  which  a  finite  number  k  +  1  of  threads  (P,,  <,)  are  defined.  We 
will  assume  that  X  =  Uie[o,fc}^* 

Definition  9  A  chain  in  X  is  a  finite  enumeration  5  C  :  [0,m]  — +  X  such  that: 

Vi  €  [0,m  -  1]  3 j  £  [0,  A:]  such  that  :  C(i )  <j  C(i  +  1) 

Cm(X)  is  the  set  of  chains  of  length  m  +  1,  C(X)  =  is  the  set  of  all  chains  of 

x 

Definition  10  £(X)  is  the  set  of  all  enumerations  that  respect  chaining  in  X,  and  therefore 
partial  ordering,  i.e.  the  set  of  all  injections  I :  Af  — ►  X  such  that  : 

Af  £  P( N)  (power  set  of  N  ) 

Vm  VC  €  Cm(X)  ,  C(0),C(m)  £  I(Af)  =>  /_1(C(0))  <  /-1(C(m))  (respect  chaining) 

The  following  properties  will  then  be  considered: 

Property  1  (No  accumulation  point)  VP  C  X  where  V  n  X\  is  infinite,  V  is  not  upper 
bounded  relative  to 


Vi  £  T>i  3p,  £  V  such  that  x  <,  p, 

This  property  leads  to  the  fact  that  VP  C  X  where  P  D  D,  ^  0  ,  there  is  a  smallest  element 
in  P  relative  to  <,•  (Vi  £  [0,  £]  3p,  £  V  such  that  Vp  £  V  pi  <,  p). 

Property  2  (Coherence)  A  chain  cannot  be  a  loop. 

'dm  £  N  VC  €  Cm(X )  such  that: 

Vi  G  [0,m  -  1]  3 j  £  [0,fc]  /  C(i)  <j  C(i  +  1) 
then:  V/  £  [0,fc]  C(m)  C(0) 

Property  3  (Connectivity)  X  is  either  a  finite  set,  or  satisfies  the  following  property  of  con¬ 
nectivity: 

For  any  element  x  in  X  ,the  set  of  elements  to  which  x  is  not  chained,  is  finite: 

Vi  £  A'  {y  £  X  /  Vm  VC  £  Cm(X)  fl  £{X)  such  that  C(0)  =  i  then  C(m)  /=p}  is  finite. 

Theorem  1  Let  X  be  a  countable  set  covered  by  a  finite  family  of  threads  {(P,,  <i)}ie[0k^  and 
satisfying  properties  (1),  (2)  and  (3).  Then  there  exists  a  total  order  on  X  that  respects  all  the 
partial  orderings.  This  order  defines  a  complete  enumeration  of  X . 

5An  enumeration  is  an  injection  which  maps  a  subset  of  N  (set  of  natural  numbers)  onto  a  subset  of  X. 
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Proof:  Lemma  (1),  (2)  and  (3)  (see  below)  prove  this  theorem. 

One  can  canonically  define  an  order  <  on  £(X)  that  encapsulates  all  the  partial  orders  <, 
defined  on  X: 

V/,  J  €  £(X)  I  <  J  if  and  only  if: 

-  Image(I)  C  Image(J) 

-  The  enumeration  J  respects  the  order  defined  by  I 

(i  <  j  =>  j-nm)  <  j~hiu))) 

Lemma  1  If  the  above  properties  (1),  (2)  and  (3)  are  satisfied,  then  £(X)  is  inductive 6. 

Proof: 

Let  S  C  £(X)  be  a  filtering  subset 
Let  Xs  =  U,€5  Image(i)  C  X 

The  following  properties  are  true: 

(а)  There  exists  a  total  order  <s  on  the  set  Xs  compatible  with  all  enumerations  of  S 

Proof:  Va,6  €  X$  3 ea,et,  6  S  /  a  6  Image(ea)  b  6  Imaged),  therefore 

3e  G  S  such  that  ea  <  e  and  ej  <  e  (because  S  is  a  filtering  subset)  and 
V/  €  S  such  that  a,b  6  Image(f)  :  f~1{a)  <  f~1{b)  e_1(a)  <  e-1(f») 

because  :  3 g  €  S  such  that  /  ■<  g  and  e  <  g 
one  can  then  define  :  a  <s  b  o  e-1(a)  <  e-1(6) 

(this  relation  is  obviously  an  order) 

(б)  For  any  subset  X s  of  Xs  there  is  a  smallest  element  relative  to  <5 

Proof:  Vt  €  [0,A:j  such  that  X,  fl  D<  5^  0,  3ruj  =  Min<i(Xs  0  D{)  (by  prop.  (1)) 

Because  the  set  of  m,  is  finite  and  <,  is  total  3m  =  Min<s(m,i) 
m  is  the  smallest  element  of  Xa  relative  to  <s  Indeed: 

Vx  €  Xa  3t  €  [0,*]  such  that  x  £  Di  =>■  x  >,  mi  =>•  x  >5  >5  m 

Properties  (a)  and  (b)  allow  us  to  define  by  induction  an  enumeration  E  of  Xs  as  follow: 

£(0)  =  Min<s(Xs)  and  E(k  +  1)  =  Min<s(Xs  -  (£(0),£(1),.. .,£?(*)}) 

By  construction,  the  enumeration  E  respects  chaining. 

It  is  obvious  that:  Image(E)  C  Xs  lets  show  that:  Image(E)  =  Xs 
Suppose  3x  6  Xs  such  that  x  £  Image(E)  then  Image(E)  is  infinite 

(if  it  were  finite,  then  because  of  the  enumeration  method  Image(E)  would  be  equal  to  Xs  ) 

By  property  (3)  3y  €  Image(E)  to  which  x  is  chained. 

Since  S  is  filtering  3 s  €  S  /  x,y  6  Image(s) 

Since  s  respects  chaining  x  <s  y  and  therefore  x  e  Image(E) 

Vs  6  S  s  ■<  E  Indeed: 

Image(s)  C  Xs  =  Image(E)  and, 

Va,6  6  Image(s)  such  that  s_1(a)  <  s-1(6)  =>■  a  <s  b  and  therefore: 

£-1(a)  <  E~l{ b)  (because  of  the  construction  of  E) 

Therefore  S  is  upperbound. 

Lemma  2  £(X)  has  a  maximal  element. 

6  A  partially  ordered  set  €  is  said  inductive,  if  any  filtering  subset  is  upperbound.  A  subset  S  of  t  is  said 
filtering  if  for  any  two  elements  x  and  y  in  S,  there  is  a  third  element  z  in  5  that  is  greater  than  both  x  and  y. 
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Proof:  by  Zorn7  Theorem. 

Lemma  3  There  is  a  complete  enumeration  of  X  which  is  maximal  relative  to  the  order  < 
defined  on  S(X). 

Proof:  Let  E  be  the  enumeration  obtained  in  the  above  lemma.  We  will  prove  that  £(N)  =  X. 
Suppose  Bxo  €  X  such  that  xq£E(N).  Then: 

Property  (1)  guarantees  that:  3 M  /  Vi  G  [0,fc]  Vo  >  M  E(a)  xo 

Moreover  3m  G  [0,  M]  /  Vi  G  [0,fc]  Vx  G  £([0,M])  U  {xo}  E(m)  x 
Proof:  If  not,  then  (*)  Vm  /  3 i  G  [0,&]  3 y  G  i?([0,M])  U  {xo}  /  E(m)  >,  y 
We  can  then  construct  by  induction  an  infinite  chain  as  follows: 

Let  yo  be  a  random  element  of  £([0,  M])  U  {xo}  and  define  yn  from  yn-i  as: 

J/n-i  >a„  Vn  ((*)  guarantees  the  existence  of  yn ) 

Since  the  set  £([0,  M ])  U  {xo}  is  finite,  this  infinite  chain  has  to  be  a  loop. 

This  is  in  contradiction  with  the  property  (2). 

We  can  now  construct  an  enumeration  of  2?([0,  M])  U  {xo}  that  respect  chaining,  by  induction: 

Let  £(0)  be  such  that  Vi  G  [0,  A:]  Vx  G  .E([0,  M])  U  {xo}  E  ( 0)^,x 

And  E' (n)  be  such  that  Vi  G  [0,  A:]  Vx  G  .EJ([0,A/])  —  £’([0,71  —  l])U  {xo}  E'( n )j£,-x 

(If  there  are  several  possibilities,  E  (n)  will  be  chosen  to  respect  the  order  defined  by  E)  : 

E{mm{E~\{y  /Vie  [0,*]VxG  £([0,  M])  -  £'([0,  n  -  1])  U  {x0}  j^.x}))) 

E'  can  be  extended  beyond  [0 ,M  +  1]  by  taking  the  order  defined  by  the  enumeration  E  on 
E([M,  oo[).  E  respects  chaining  as  well,  and  is  strictly  greater  that  E.  Which  is  impossible. 

Theorem  2  (Theorem  1  bis)  Let  X  be  a  countable  set  covered  by  a  finite  family  of  threads 
{(T>i,  <i)}i€[o,fe]  satisfying  properties  (1)  and  (2).  Then  there  exists  a  total  order  on  X  that 
respects  all  the  partial  orderings.  This  order  defines  a  complete  enumeration  of  X. 

Proof:8  The  proof  for  this  lemma  is  done  by  induction  on  the  number  k  of  threads  involved. 
If  there  is  only  one  thread,  the  lemma  is  trivial.  Let  suppose  that  theorem  (2)  is  proved  for  k 
threads.  The  following  two  cases  are  to  be  considered: 

(1)  Vi,  j  G  [0,k]  D  Vj  is  finite 

We  can  then  define  for  all  i  G  [0,  fc]  ,  m,  G  Vi  such  that:  m;  =  max  Uie[o, &],;*»  (^«  n  Vj) 

Let  V>m ,  =  {x  G  Vi  /  x  >,  m,}  and  D<m.  =  (x  G  P,  /  x  <j  m,} 

Since  <,  is  a  total  order  on  P,  we  have:  P>m,  U  V<mi  =  Vi  and  P>m,  fl  P<m,  =  0 

Let  X>  =  U,€(o,fc]^>m,  and  X<  =  we  have: 

(a)  X>  U  X<  =  X  (X  =  U«e[0,fc]  =  Ui€[0,Jfc](^>m,  U  C  (A>  U  X< )) 

(b)  X>nX<=  0.  Indeed: 

If  x  G  X>  fl  X<  then  3i,  j  G  [0,  &]  such  that:  x  G  V>mi  n  V<m} 

Then  x  G  Vi  n  Vj  and  therefore  x<,m,-  and  x<jmj 
Which  is  in  contradiction  with  the  fact  that  x  G  P>m, 


7Any  inductive  set  has  a  maximal  element. 

8  A  constructive  proof  for  this  theorem  can  also  be  done 
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Since  X<  is  finite,  we  can  apply  the  theorem  (1)  to  X<  with  the  family  of  threads 
{(X<  n  Therefore  there  is  a  total  order,  and  an  enumeration  on  X<  that  respects 

all  the  partied  orders.  This  finite  enumeration  can  be  completed  with  a  standard  diagonal 
enumeration  9  of  X>.  (Thanks  to  the  fact  that  the  threads  form  a  partitioning  of  X>)  Because 
of  property  (1),  this  enumeration  reaches  all  the  elements  of  X<. 

(2)  3i,j  €  [0,  A:]  P,  fl  Pj  is  infinite 

There  is  an  enumeration  of  Vi  fi  Vj  that  respects  both  <,  and  <j. 

Proof:  Let  dtio  =  min  (P,  fl  Vj)  and  dj0  =  min  (A'  fi  Vj) 

<>  <j 

(property  (1)  ensures  the  existence  of  these  both  smallest  elements) 

Then  d,, o  <i  dj, o  <j  dj,o  and  therefore  by  property  (2)  :  d,t o  =  dj, o 

Let  dijt0  =  difi  ,  we  can  then  define  by  induction  the  following  enumeration: 

dij,n+i  =  min  ( (Pi  n  Vj)—  {d^o,  d,ji , d,jn} ) 

Si 

By  property  (1)  again,  this  enumeration  reaches  any  element  of  P,  fl  Vj  and  by 
construction  it  respects  both  orders.  We  will  denote  d,j(n  by  dn. 

Let  us  define  the  following  subsets  of  Pjj  =  Pv  U  Vj\ 

[di,dt+ 1[<.  =  {x  €  Vi  /  di<ix  <i  di+1  } 

[d,,d/+1[,w  =  w  {x  €  X  /  3m  €  N  /  €  C(X)  n  £{X)  / 

Cm(0)  €  [di,di+1[Ki  and  Cm(rn)  €  [di,dl+1[<}  and  x  6  Cm([0,m])} 

[di,di+ 1[  =  [di,di+ i[<t  U  [dj,d/+i[<3  U[d,,d/+1({_3  U  [cf;, <J[dj,d|+1[i_>i  U  [dj,dj+1[^J- 


(For  any  two  elements  of  [d(,d(+i[  any  chain  that  connects  them  is  also  contained  in  [d/,d/+1[). 
The  above  defined  set  [d;,  d;+v[  is  finite: 

Proof:  By  property  (1),  [d/,d(+1[<<  is  finite.  Let  us  show  that  the  set  of  chains  which  respect 
the  orders  and  end  in  [dj,d;+1[<i  is  finite.  Let  xo  be  an  element  of  [d/,d;+i[<  . 

If  the  set  of  chains  ending  at  xo  were  infinite,  then  there  would  be  an  infinite 
set  x0<  of  elements  of  X  chained  to  Xq- 

Therefore,  3a  €  [0,  Ac]  such  that  xo<  fi  is  infinite,  and  consequently  xo  is  an 
accumulation  point.  This  is  in  contradiction  with  property  (1).  This  proves  that 
the  set  Xo<  is  finite  and  so  are  [dj,d/+i[,-  •  and  [d;,d(+1[  . 

V{j  C  V{j  =  (J/gjq  [d/,  d/+i[ 

Properties  (1),  (2)  and  (3)  are  satisfied  by  Vij  with  the  family  of  threads  {(PtJ  D  PQ),  <a}a6[0  kj 
Proof:  Properties  (1)  and  (2)  are  directly  derived  from  the  hypotheses. 

Let  x  6  V{j  ,  3a  such  that  x  €  [dQ,d0+i[ 

Then  x  is  chained  to  any  y  €  [dp,d()+\[  where  /3  >  a 
( x  is  chained  to  dQ+i  which  is  chained  to  dp  which  is  chained  to  y) 

Therefore  the  set  of  elements  to  which  x  is  not  chained  is  included  in 
which  is  finite.  This  proves  property  (3). 


s 


We  start  by  numbering  the  first  element  of  each  thread,  and  then  the  second  element 


of  each  thread  etc,  ... 
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Theorem  (1)  ensures  the  existence  of  a  total  order  on  that  respects  all  partial  orders 
and  which  defines  an  enumeration.  In  order  to  bring  the  proof  to  completion  we  now  need  to 
apply  the  induction  hypothesis  to  the  set  X  with  the  threads  <ij), (Z>t, <i)ie[o,fc]-{i  j}}  • 

Property  (1)  is  satisfied  by  hypothesis  for  V\  'il  6  [0,fc]  -  {i,j},  but  also  for  <tJ)  because 
of  the  enumeration  defined  by  the  order  <ij  . 

Property  (2)  stands  for  [X ;  </)/e[o, *]-{,-, ;}}]• 

Proof:  Suppose  there  is  a  loop  Cm.  One  can  assume  that  Cm  has  the  smallest  possible 
length.  Then,  each  order  relation  <a  appears  only  ones  in  the  chain.  Indeed: 

If  3k, l  £  [0,m],fc  <  /  such  that  Cm(k)  <a  Cm(k  +  1)  and  Cm{l)  <a  Cm(l  +  1) 

(a)  if  Cm(k)<aCm(l  +  1)  Then  the  chain  obtained  from  Cm  by  removing 
the  elements  Cm(k)  to  Cm(l  +  1)  is  also  a  loop  and  is  shorter 

(b)  if  Cm(k)>aCm(l  +  1)  Then  the  chain  obtained  from  Cm  by  keeping 
only  the  elements  Cm(k  +  1)  to  Cm(i  +  1)  is  also  a  loop  and  is  shorter. 

<ij  is  involved  in  the  loop,  otherwise  Cm  would  be  a  loop  in  (X,  (Vi,  <»)*€[o,fc])- 
Therefore,  3!l  €  [0,m]  such  that  Cm(l)  <ij  Cm(l  +  1  mod  m) 

Since  Cm(l)  ,  Cm(l  +  1)  €  T>ij  then 
3 a,b  /  Cm(l )  €  [da,da+i[  and  Cm(l  +  1)  €  [<4)<4+i[ _ 

(a)  a  =  b  is  impossible,  because  the  enumeration  of  T>ij  which  was  constructed  with 
theorem  (1)  respects  all  the  chains  that  connect  two  elements  of  [</<,>  <4+i[ 

(b)  a  <  b  is  impossible,  because  then  Cm(l)  would  be  chained  to  Cm(l  +  1)  via  orders 
chosen  from  the  original  set  of  orders:  (<«),6[o,it] 

Therefore  Cm  would  be  also  a  loop  in  ( X,(T>i ,  <t)«e[o,fc])- 

(c)  a  >  b  As  case  (b). 
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